Cooperative and Fast-Learning Information Extraction from Business Documents for Document Archiving

نویسنده

  • Daniel Esser
چکیده

Automatic information extraction from scanned business documents is especially valuable in the application domain of document management and archiving. Although current solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts or administrators. Especially small office/home office (SOHO) users and private individuals often do not use such systems because of the need for configuration and long periods of training to reach acceptable extraction rates. Therefore we present a solution for information extraction out of scanned business documents that fits the requirements of these users. Our approach is highly adaptable to new document types and index fields and uses only a minimum of training documents to reach extraction rates comparable to related works and manual document indexing. By providing a cooperative extraction system, which allows sharing extraction knowledge between participants, we furthermore want to minimize the number of user feedback and increase the acceptance of such a system. A first evaluation of our solution according to a document set of 12,500 documents with 10 commonly used fields shows competitive results above 85% F1-measure. Results above 75% F1-measure are already reached with a minimal training set of only one document per template.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Few-exemplar Information Extraction for Business Documents

The automatic extraction of relevant information from business documents (sender, recipient, date, etc.) is a valuable task in the application domain of document management and archiving. Although current scientific and commercial self-learning solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts ...

متن کامل

Modelspace - Cooperative Document Information Extraction in Flexible Hierarchies

Business document indexing for ordered filing of documents is a crucial task for every company. Since this is a tedious error prone work, automatic or at least semi-automatic approaches have a high value. One approach for semi-automated indexing of business documents uses self-learning information extraction methods based on user feedback. While these methods require no management of complex in...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

Automatic indexing of scanned documents: a layout-based approach

Archiving official written documents such as invoices, reminders and account statements in business and private area gets more and more important. Creating appropriate index entries for document archives like sender’s name, creation date or document number is a tedious manual work. We present a novel approach to handle automatic indexing of documents based on generic positional extraction of

متن کامل

Simple Method for Ontology Automatic Extraction from Documents

There are many situations where it is needed to represent and analyze the concepts that describe a document or a collection of documents. One of such situations is the information retrieval, which is becoming more complex by the growing number and variety of document types. One way to represent the concepts is through a formal structure using ontologies. Thus, this article presents a fast and s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013